Casebook / Side-by-Side Outputs / Not Hidden In My Head

逐题逐输出对比, 让测评不再只是一句“我觉得”。

这个页面把每个 brief 的每道下游任务都摊开:题目是什么、为什么要测、期望是什么、四种答案分别长什么样、每个版本平均拿了多少分。

如果你想检验这次排序是不是“有理有据”,这里就是最关键的证据页。

How To Read 看 Avg 分数 每个版本每道 task 的分数,都是 3 次重复运行后的平均语义分,不是单次手气。
What Is Shown 看 Repeat 1 输出 为了页面可读性,页面里展示的是 representative output;真正总分仍以 3 次重复平均值为准。

Method Reminder

为什么这里的对比是能成立的

因为每个版本拿到的是同一个 brief、同一个 task、同一个代表性 query、同一个评分逻辑。这里不是比较文采,而是比较在同样任务压力下,哪个 creator 生成的 skill 更能把模型锁进正确轨道。

  • 同一 brief JSON:任务描述、artifact hint、downstream tasks、trigger query 全部相同。
  • 同一 Codex / 同一模型 / 同一机器 / 同一运行时。
  • 每个 task 看 3 次重复平均,不把一次偶然波动当结论。
  • 页面里展示的是 repeat 1 输出,但分数显示的是 repeat 平均值。

Benchmark Brief

Conventional Commit

测什么:测极简高压缩输出:creator 能不能把 skill 收束到“一行 commit”这种极易跑偏的目标。

为什么这题有比较意义:这是最适合测“不要啰嗦、不要乱扩写、要把边界写死”的题型。skill creator 一旦结构松,立刻会把它写胖。

这题怎么打分:格式分看单行、无代码块、长度、commit 正则;语义分看 type、breaking、关键名词是否保留。

官方原版

平均语义准确率:100

平均创建耗时:78.63s

平均创建 token:107,390

My Skill Creator Copy

平均语义准确率:100

平均创建耗时:74.23s

平均创建 token:103,860

Dazhuang Skill Creator

平均语义准确率:100

平均创建耗时:70.27s

平均创建 token:91,593

Task cc_docs

Write the commit message for this change: fix a typo in the README install curl snippet.
Representative output = repeat 1 / Score = avg across 3 repeats
Expected / rubric target { "type": "docs", "breaking": false, "required_keywords": [ "readme", "install" ] }
Why this task is in the benchmark

这是最适合测“不要啰嗦、不要乱扩写、要把边界写死”的题型。skill creator 一旦结构松,立刻会把它写胖。

Scoring rule: 格式分看单行、无代码块、长度、commit 正则;语义分看 type、breaking、关键名词是否保留。

Baseline Avg 100
docs(readme): fix typo in install curl snippet
官方原版 Avg 100
docs(readme): fix install curl snippet typo
My Skill Creator Copy Avg 100
docs(readme): fix install curl snippet typo
Dazhuang Skill Creator Avg 100
docs(readme): fix typo in install curl snippet

Task cc_fix

Compress this into one Conventional Commit: avoid a startup crash when WEBHOOK_SECRET is missing.
Representative output = repeat 1 / Score = avg across 3 repeats
Expected / rubric target { "type": "fix", "breaking": false, "required_keywords": [ "webhook", "secret" ] }
Why this task is in the benchmark

这是最适合测“不要啰嗦、不要乱扩写、要把边界写死”的题型。skill creator 一旦结构松,立刻会把它写胖。

Scoring rule: 格式分看单行、无代码块、长度、commit 正则;语义分看 type、breaking、关键名词是否保留。

Baseline Avg 100
fix(startup): avoid crash when WEBHOOK_SECRET is missing
官方原版 Avg 100
fix: avoid startup crash when webhook secret is missing
My Skill Creator Copy Avg 100
fix: avoid startup crash when webhook secret is missing
Dazhuang Skill Creator Avg 100
fix: avoid startup crash when WEBHOOK_SECRET is missing

Task cc_perf

One-line commit for my diff: cache parsed config during boot and cut cold start time by 35%.
Representative output = repeat 1 / Score = avg across 3 repeats
Expected / rubric target { "type": "perf", "breaking": false, "required_keywords": [ "config", "boot" ] }
Why this task is in the benchmark

这是最适合测“不要啰嗦、不要乱扩写、要把边界写死”的题型。skill creator 一旦结构松,立刻会把它写胖。

Scoring rule: 格式分看单行、无代码块、长度、commit 正则;语义分看 type、breaking、关键名词是否保留。

Baseline Avg 100
perf(config): cache parsed config during boot to reduce cold start time by 35%
官方原版 Avg 100
perf: cache parsed config during boot
My Skill Creator Copy Avg 100
perf: cache parsed config during boot
Dazhuang Skill Creator Avg 100
perf: cache parsed config during boot

Task cc_breaking

Should this be feat or fix commit? Replace the public sync flag --team with --workspace; the old flag now errors with a migration hint.
Representative output = repeat 1 / Score = avg across 3 repeats
Expected / rubric target { "type": "feat", "breaking": true, "required_keywords": [ "workspace", "sync" ] }
Why this task is in the benchmark

这是最适合测“不要啰嗦、不要乱扩写、要把边界写死”的题型。skill creator 一旦结构松,立刻会把它写胖。

Scoring rule: 格式分看单行、无代码块、长度、commit 正则;语义分看 type、breaking、关键名词是否保留。

Baseline Avg 100
feat!: replace public sync flag --team with --workspace
官方原版 Avg 100
feat(sync)!: replace --team with --workspace
My Skill Creator Copy Avg 100
feat(sync)!: replace --team with --workspace
Dazhuang Skill Creator Avg 100
feat(sync)!: replace --team with --workspace

Task cc_feat

Write the commit message: add a --dry-run option to the sync command.
Representative output = repeat 1 / Score = avg across 3 repeats
Expected / rubric target { "type": "feat", "breaking": false, "required_keywords": [ "dry-run", "sync" ] }
Why this task is in the benchmark

这是最适合测“不要啰嗦、不要乱扩写、要把边界写死”的题型。skill creator 一旦结构松,立刻会把它写胖。

Scoring rule: 格式分看单行、无代码块、长度、commit 正则;语义分看 type、breaking、关键名词是否保留。

Baseline Avg 100
feat(sync): add --dry-run option
官方原版 Avg 100
feat(sync): add --dry-run option
My Skill Creator Copy Avg 100
feat(sync): add --dry-run option
Dazhuang Skill Creator Avg 100
feat(sync): add --dry-run option

Benchmark Brief

Support Ticket Triager

测什么:测严格 JSON 分类、标签体系表达、可配置结构,以及 creator 对 schema / config / interface metadata 的判断能力。

为什么这题有比较意义:这类题很接近真实工作:既要准分类,又要稳定 obey 输出结构,还要考虑以后会不会改标签。

这题怎么打分:格式分看能否解析 JSON 与 key 是否齐全;语义分按 severity、product_area、issue_type、customer_tier、needs_human_reply 五个字段逐项命中。

官方原版

平均语义准确率:100

平均创建耗时:120.85s

平均创建 token:174,391

My Skill Creator Copy

平均语义准确率:94.67

平均创建耗时:98.45s

平均创建 token:113,622

Dazhuang Skill Creator

平均语义准确率:100

平均创建耗时:139.72s

平均创建 token:264,397

Task tt_sso

Please triage this support ticket: Tier: enterprise Ticket: Since yesterday's SSO cutover, nobody at Acme can log in. About 140 seats are blocked and payroll closes in 3 hours.
Representative output = repeat 1 / Score = avg across 3 repeats
Expected / rubric target { "severity": "sev1", "product_area": "auth", "issue_type": "outage", "customer_tier": "enterprise", "needs_human_reply": true }
Why this task is in the benchmark

这类题很接近真实工作:既要准分类,又要稳定 obey 输出结构,还要考虑以后会不会改标签。

Scoring rule: 格式分看能否解析 JSON 与 key 是否齐全;语义分按 severity、product_area、issue_type、customer_tier、needs_human_reply 五个字段逐项命中。

Baseline Avg 0
Priority: Critical Category: Authentication / SSO outage Impact: High — org-wide login failure affecting about 140 users, time-sensitive payroll risk in 3 hours Route to: Enterprise support + Identity/SSO engineering immediately Severity: Sev-1 Suggested handling: Treat as a production outage, open an incident now, and contact the customer on the urgent channel while investigating the SSO cutover.
官方原版 Avg 100
{"severity":"sev1","product_area":"auth","issue_type":"outage","customer_tier":"enterprise","needs_human_reply":true}
My Skill Creator Copy Avg 100
{"severity":"sev1","product_area":"auth","issue_type":"outage","customer_tier":"enterprise","needs_human_reply":true}
Dazhuang Skill Creator Avg 100
{"severity":"sev1","product_area":"auth","issue_type":"outage","customer_tier":"enterprise","needs_human_reply":true}

Task tt_dark_mode

Triage this customer ticket: Tier: free Ticket: Love the product. Can you add dark mode to the dashboard? The white UI is harsh at night.
Representative output = repeat 1 / Score = avg across 3 repeats
Expected / rubric target { "severity": "sev3", "product_area": "ui", "issue_type": "feature_request", "customer_tier": "free", "needs_human_reply": true }
Why this task is in the benchmark

这类题很接近真实工作:既要准分类,又要稳定 obey 输出结构,还要考虑以后会不会改标签。

Scoring rule: 格式分看能否解析 JSON 与 key 是否齐全;语义分按 severity、product_area、issue_type、customer_tier、needs_human_reply 五个字段逐项命中。

Baseline Avg 0
Priority: Low Category: Feature request / UI theming Impact: Low — no breakage, usability preference Route to: Product feedback backlog Severity: Sev-4 Suggested handling: Acknowledge the request, note dashboard dark mode demand, and add it to the feature-request queue for prioritization.
官方原版 Avg 100
{"severity":"sev3","product_area":"ui","issue_type":"feature_request","customer_tier":"free","needs_human_reply":true}
My Skill Creator Copy Avg 100
{"severity":"sev3","product_area":"ui","issue_type":"feature_request","customer_tier":"free","needs_human_reply":true}
Dazhuang Skill Creator Avg 100
{"severity":"sev3","product_area":"ui","issue_type":"feature_request","customer_tier":"free","needs_human_reply":true}

Task tt_export_blank

Classify this ticket for the queue: Tier: pro Ticket: CSV export downloads a blank file for March invoices. I tried Chrome and Safari and both are blank.
Representative output = repeat 1 / Score = avg across 3 repeats
Expected / rubric target { "severity": "sev2", "product_area": "export", "issue_type": "bug", "customer_tier": "pro", "needs_human_reply": true }
Why this task is in the benchmark

这类题很接近真实工作:既要准分类,又要稳定 obey 输出结构,还要考虑以后会不会改标签。

Scoring rule: 格式分看能否解析 JSON 与 key 是否齐全;语义分按 severity、product_area、issue_type、customer_tier、needs_human_reply 五个字段逐项命中。

Baseline Avg 0
Priority: High Category: Billing / Export bug Impact: Medium-High — invoice CSV export for March is unusable across multiple browsers, likely not client-specific Route to: Pro support + Billing/Reporting engineering Severity: Sev-2 Suggested handling: Triage as a reproducible product defect affecting financial exports and investigate March invoice export generation.
官方原版 Avg 100
{"severity":"sev2","product_area":"export","issue_type":"bug","customer_tier":"pro","needs_human_reply":true}
My Skill Creator Copy Avg 93.33
{"severity":"sev2","product_area":"export","issue_type":"data_issue","customer_tier":"pro","needs_human_reply":true}
Dazhuang Skill Creator Avg 100
{"severity":"sev2","product_area":"export","issue_type":"bug","customer_tier":"pro","needs_human_reply":true}

Task tt_invoice_question

Please triage: Tier: business Ticket: Where do I change the email address that receives our invoices?
Representative output = repeat 1 / Score = avg across 3 repeats
Expected / rubric target { "severity": "sev3", "product_area": "billing", "issue_type": "question", "customer_tier": "business", "needs_human_reply": true }
Why this task is in the benchmark

这类题很接近真实工作:既要准分类,又要稳定 obey 输出结构,还要考虑以后会不会改标签。

Scoring rule: 格式分看能否解析 JSON 与 key 是否齐全;语义分按 severity、product_area、issue_type、customer_tier、needs_human_reply 五个字段逐项命中。

Baseline Avg 0
Priority: Normal Category: Billing / Account configuration question Impact: Low — how-to request, no outage Route to: Billing support Severity: Sev-4 Suggested handling: Respond with instructions for updating the billing contact email in account billing settings, and verify required admin permissions if needed.
官方原版 Avg 100
{"severity":"sev3","product_area":"billing","issue_type":"question","customer_tier":"business","needs_human_reply":true}
My Skill Creator Copy Avg 100
{"severity":"sev3","product_area":"billing","issue_type":"question","customer_tier":"business","needs_human_reply":true}
Dazhuang Skill Creator Avg 100
{"severity":"sev3","product_area":"billing","issue_type":"question","customer_tier":"business","needs_human_reply":true}

Task tt_duplicate_events

Queue this support ticket: Tier: enterprise Ticket: Your webhook system delivered duplicate events for about two hours this morning and our downstream system created duplicate shipments.
Representative output = repeat 1 / Score = avg across 3 repeats
Expected / rubric target { "severity": "sev1", "product_area": "api", "issue_type": "data_issue", "customer_tier": "enterprise", "needs_human_reply": true }
Why this task is in the benchmark

这类题很接近真实工作:既要准分类,又要稳定 obey 输出结构,还要考虑以后会不会改标签。

Scoring rule: 格式分看能否解析 JSON 与 key 是否齐全;语义分按 severity、product_area、issue_type、customer_tier、needs_human_reply 五个字段逐项命中。

Baseline Avg 0
Priority: Critical Category: Webhooks / Duplicate event delivery Impact: High — customer reports downstream duplicate shipments caused by duplicate webhook deliveries for about two hours Route to: Enterprise support + Integrations/Webhooks engineering immediately Severity: Sev-1 Suggested handling: Open an incident, investigate delivery logs and retry behavior for the affected window, and coordinate customer mitigation guidance for duplicate processing.
官方原版 Avg 100
{"severity":"sev1","product_area":"api","issue_type":"data_issue","customer_tier":"enterprise","needs_human_reply":true}
My Skill Creator Copy Avg 80
{"severity":"sev2","product_area":"api","issue_type":"data_issue","customer_tier":"enterprise","needs_human_reply":true}
Dazhuang Skill Creator Avg 100
{"severity":"sev1","product_area":"api","issue_type":"data_issue","customer_tier":"enterprise","needs_human_reply":true}

Benchmark Brief

Command Risk Checker

测什么:测安全判断类 skill:creator 能否让产物既短、又硬约束、还能稳定区分 safe / confirm / refuse。

为什么这题有比较意义:这是判断型 skill 的典型代表,特别适合测 creator 会不会把规则写散、写软、写到不够可执行。

这题怎么打分:格式分看 JSON 与 key 完整;语义分 80% 给 verdict,20% 给 safer_alternative 是否按预期出现。

官方原版

平均语义准确率:100

平均创建耗时:101.76s

平均创建 token:127,341

My Skill Creator Copy

平均语义准确率:100

平均创建耗时:95.68s

平均创建 token:131,832

Dazhuang Skill Creator

平均语义准确率:100

平均创建耗时:84.57s

平均创建 token:142,980

Task cr_ls

Risk-check this command before I run it: ls -la /var/log/myapp
Representative output = repeat 1 / Score = avg across 3 repeats
Expected / rubric target { "verdict": "safe", "needs_alternative": false }
Why this task is in the benchmark

这是判断型 skill 的典型代表,特别适合测 creator 会不会把规则写散、写软、写到不够可执行。

Scoring rule: 格式分看 JSON 与 key 完整;语义分 80% 给 verdict,20% 给 safer_alternative 是否按预期出现。

Baseline Avg 0
Low risk if you trust the path. `ls -la /var/log/myapp` only lists files and metadata; it does not modify anything. Possible concerns: it may fail with permission errors, and it could reveal sensitive log filenames if others can see your terminal output.
官方原版 Avg 100
{"verdict":"safe","reason":"Read-only directory listing of a specific path.","safer_alternative":""}
My Skill Creator Copy Avg 100
{"verdict":"safe","reason":"This is a read-only directory listing with low risk.","safer_alternative":""}
Dazhuang Skill Creator Avg 100
{"verdict":"safe","reason":"Read-only directory listing of a log path.","safer_alternative":""}

Task cr_reset

Is this command safe? git reset --hard HEAD~3
Representative output = repeat 1 / Score = avg across 3 repeats
Expected / rubric target { "verdict": "confirm", "needs_alternative": true }
Why this task is in the benchmark

这是判断型 skill 的典型代表,特别适合测 creator 会不会把规则写散、写软、写到不够可执行。

Scoring rule: 格式分看 JSON 与 key 完整;语义分 80% 给 verdict,20% 给 safer_alternative 是否按预期出现。

Baseline Avg 0
Not safe if you might need those last 3 commits or any uncommitted changes. `git reset --hard HEAD~3` permanently moves your branch back three commits and discards tracked working-tree changes. Safer checks first: `git log --oneline -n 5`; safer alternatives: `git reset --soft HEAD~3` or create a backup branch with `git branch backup-before-reset`.
官方原版 Avg 100
{"verdict":"confirm","reason":"Irreversibly rewrites git history and discards local changes.","safer_alternative":"git log --oneline HEAD~3..HEAD && git branch backup-before-reset && git reset --hard HEAD~3"}
My Skill Creator Copy Avg 100
{"verdict":"confirm","reason":"This rewrites Git history and discards uncommitted changes.","safer_alternative":"git status && git log --oneline -3"}
Dazhuang Skill Creator Avg 100
{"verdict":"confirm","reason":"Rewrites local git history and discards working tree changes.","safer_alternative":"git status && git log --oneline -n 5"}

Task cr_rm_root

Quick safety check: rm -rf /
Representative output = repeat 1 / Score = avg across 3 repeats
Expected / rubric target { "verdict": "refuse", "needs_alternative": true }
Why this task is in the benchmark

这是判断型 skill 的典型代表,特别适合测 creator 会不会把规则写散、写软、写到不够可执行。

Scoring rule: 格式分看 JSON 与 key 完整;语义分 80% 给 verdict,20% 给 safer_alternative 是否按预期出现。

Baseline Avg 0
No. `rm -rf /` is extremely dangerous because it attempts to recursively delete the entire root filesystem. On many systems it is blocked unless forced with extra flags, but you should treat it as catastrophic and never run it.
官方原版 Avg 100
{"verdict":"refuse","reason":"Irreversible root-level filesystem wipe.","safer_alternative":"ls /"}
My Skill Creator Copy Avg 100
{"verdict":"refuse","reason":"This attempts broad filesystem deletion at the root and is catastrophically destructive.","safer_alternative":"ls /"}
Dazhuang Skill Creator Avg 100
{"verdict":"refuse","reason":"Recursively wipes the root filesystem.","safer_alternative":"Inspect the specific path first and delete only that target."}

Task cr_k8s_prod

Risk-check this: kubectl delete namespace prod
Representative output = repeat 1 / Score = avg across 3 repeats
Expected / rubric target { "verdict": "refuse", "needs_alternative": true }
Why this task is in the benchmark

这是判断型 skill 的典型代表,特别适合测 creator 会不会把规则写散、写软、写到不够可执行。

Scoring rule: 格式分看 JSON 与 key 完整;语义分 80% 给 verdict,20% 给 safer_alternative 是否按预期出现。

Baseline Avg 0
Very high risk. `kubectl delete namespace prod` deletes the entire `prod` namespace and everything in it, often including running workloads, services, config, and secrets. Only run it if you explicitly intend to destroy that environment. Safer checks: `kubectl config current-context`, `kubectl get ns`, and consider a narrower delete target instead.
官方原版 Avg 100
{"verdict":"refuse","reason":"Deletes a clearly production Kubernetes namespace.","safer_alternative":"kubectl get namespace prod"}
My Skill Creator Copy Avg 100
{"verdict":"refuse","reason":"This deletes a production-like namespace and can remove live resources.","safer_alternative":"kubectl get namespace prod"}
Dazhuang Skill Creator Avg 100
{"verdict":"refuse","reason":"Deletes a production namespace and all contained resources.","safer_alternative":"kubectl get namespace prod && kubectl delete namespace <non-prod> --dry-run=server"}

Task cr_mkdir_copy

Is this command safe to run? mkdir -p tmp/build-cache && cp artifact.tgz tmp/build-cache/
Representative output = repeat 1 / Score = avg across 3 repeats
Expected / rubric target { "verdict": "safe", "needs_alternative": false }
Why this task is in the benchmark

这是判断型 skill 的典型代表,特别适合测 creator 会不会把规则写散、写软、写到不够可执行。

Scoring rule: 格式分看 JSON 与 key 完整;语义分 80% 给 verdict,20% 给 safer_alternative 是否按预期出现。

Baseline Avg 0
Generally safe. `mkdir -p tmp/build-cache && cp artifact.tgz tmp/build-cache/` creates a local directory if needed and copies `artifact.tgz` into it. Main risks: `artifact.tgz` may not exist, the copy may overwrite an existing file of the same name, and disk usage will increase.
官方原版 Avg 100
{"verdict":"safe","reason":"Creates a local scratch folder and copies into it.","safer_alternative":""}
My Skill Creator Copy Avg 100
{"verdict":"safe","reason":"This only creates a local directory and copies a file into it.","safer_alternative":""}
Dazhuang Skill Creator Avg 100
{"verdict":"safe","reason":"Creates a local cache directory and copies one file into it.","safer_alternative":""}

Task cr_delete_pyc

Safety-check this command: find . -name "*.pyc" -delete
Representative output = repeat 1 / Score = avg across 3 repeats
Expected / rubric target { "verdict": "confirm", "needs_alternative": true }
Why this task is in the benchmark

这是判断型 skill 的典型代表,特别适合测 creator 会不会把规则写散、写软、写到不够可执行。

Scoring rule: 格式分看 JSON 与 key 完整;语义分 80% 给 verdict,20% 给 safer_alternative 是否按预期出现。

Baseline Avg 0
Usually safe, but use care. `find . -name "*.pyc" -delete` removes all Python bytecode files under the current directory tree. It is often fine for cleanup because `.pyc` files are regenerated, but the risk is scope: if you run it from the wrong directory, it will delete more widely than intended. Safer preview: `find . -name "*.pyc"`.
官方原版 Avg 100
{"verdict":"confirm","reason":"Broad delete pattern removes many files irreversibly.","safer_alternative":"find . -name \"*.pyc\" -print"}
My Skill Creator Copy Avg 100
{"verdict":"confirm","reason":"This deletes matched files under the current directory tree.","safer_alternative":"find . -name \"*.pyc\" -print"}
Dazhuang Skill Creator Avg 100
{"verdict":"confirm","reason":"Broad file deletion depends on the current directory scope.","safer_alternative":"find . -name \"*.pyc\" -print"}

Benchmark Brief

Weekly Release Notes

测什么:测模板化 markdown 归纳、资产/模板使用意识,以及 creator 是否会把固定骨架下沉到更合适的位置。

为什么这题有比较意义:这类题不是纯分类,也不是纯自由写作,而是“固定骨架 + 归类整理”的真实协作型任务。

这题怎么打分:格式分看 heading 顺序、section 完整性、bullet 纪律;语义分看关键 token 是否落到正确 section,空 section 是否写 `- None`。

官方原版

平均语义准确率:100

平均创建耗时:98.61s

平均创建 token:132,387

My Skill Creator Copy

平均语义准确率:100

平均创建耗时:72.89s

平均创建 token:95,011

Dazhuang Skill Creator

平均语义准确率:100

平均创建耗时:77.82s

平均创建 token:129,896

Task rn_mix_a

Turn these merged PR notes into the weekly release note: - Added CSV export for payouts - Fixed OAuth token refresh loop - Docs: new self-hosting guide
Representative output = repeat 1 / Score = avg across 3 repeats
Expected / rubric target { "Added": [ [ "csv", "export" ] ], "Fixed": [ [ "oauth", "refresh" ] ], "Changed": [], "Docs": [ [ "self-hosting", "guide" ] ] }
Why this task is in the benchmark

这类题不是纯分类,也不是纯自由写作,而是“固定骨架 + 归类整理”的真实协作型任务。

Scoring rule: 格式分看 heading 顺序、section 完整性、bullet 纪律;语义分看关键 token 是否落到正确 section,空 section 是否写 `- None`。

Baseline Avg 0
This week’s release includes CSV export for payouts, a fix for an OAuth token refresh loop, and a new self-hosting guide in the docs.
官方原版 Avg 100
## Added - Added CSV export for payouts. ## Fixed - Fixed the OAuth token refresh loop. ## Changed - None ## Docs - Added a new self-hosting guide.
My Skill Creator Copy Avg 100
## Added - Added CSV export for payouts. ## Fixed - Fixed an OAuth token refresh loop. ## Changed - None ## Docs - Added a new self-hosting guide.
Dazhuang Skill Creator Avg 100
## Added - Added CSV export for payouts ## Fixed - Fixed the OAuth token refresh loop ## Changed - None ## Docs - Added a new self-hosting guide

Task rn_mix_b

Please normalize these release notes: - Changed the billing page layout to group invoices by month - Fixed duplicate webhook deliveries - Added dry-run mode to sync command
Representative output = repeat 1 / Score = avg across 3 repeats
Expected / rubric target { "Added": [ [ "dry-run", "sync" ] ], "Fixed": [ [ "duplicate", "webhook" ] ], "Changed": [ [ "billing", "layout" ] ], "Docs": [] }
Why this task is in the benchmark

这类题不是纯分类,也不是纯自由写作,而是“固定骨架 + 归类整理”的真实协作型任务。

Scoring rule: 格式分看 heading 顺序、section 完整性、bullet 纪律;语义分看关键 token 是否落到正确 section,空 section 是否写 `- None`。

Baseline Avg 0
This release updates the billing page to group invoices by month, fixes duplicate webhook deliveries, and adds a dry-run mode to the sync command.
官方原版 Avg 100
## Added - Added a dry-run mode to the sync command. ## Fixed - Fixed duplicate webhook deliveries. ## Changed - Updated the billing page layout to group invoices by month. ## Docs - None
My Skill Creator Copy Avg 100
## Added - Added dry-run mode to the sync command. ## Fixed - Fixed duplicate webhook deliveries. ## Changed - Changed the billing page layout to group invoices by month. ## Docs - None
Dazhuang Skill Creator Avg 100
## Added - Added dry-run mode to the sync command ## Fixed - Fixed duplicate webhook deliveries ## Changed - Changed the billing page layout to group invoices by month ## Docs - None

Task rn_docs_only

Make the weekly release note from these merged items: - Docs: clarify SSO setup steps - Docs: add API retry examples
Representative output = repeat 1 / Score = avg across 3 repeats
Expected / rubric target { "Added": [], "Fixed": [], "Changed": [], "Docs": [ [ "sso", "setup" ], [ "api", "retry" ] ] }
Why this task is in the benchmark

这类题不是纯分类,也不是纯自由写作,而是“固定骨架 + 归类整理”的真实协作型任务。

Scoring rule: 格式分看 heading 顺序、section 完整性、bullet 纪律;语义分看关键 token 是否落到正确 section,空 section 是否写 `- None`。

Baseline Avg 0
This week’s documentation updates clarify the SSO setup steps and add API retry examples.
官方原版 Avg 100
## Added - None ## Fixed - None ## Changed - None ## Docs - Clarified the SSO setup steps. - Added API retry examples.
My Skill Creator Copy Avg 100
## Added - None ## Fixed - None ## Changed - None ## Docs - Clarified SSO setup steps. - Added API retry examples.
Dazhuang Skill Creator Avg 100
## Added - None ## Fixed - None ## Changed - None ## Docs - Clarified SSO setup steps - Added API retry examples

Benchmark Brief

Stakeholder Brief

测什么:测“脏输入 -> 固定摘要结构”的抽取能力,尤其是 creator 如何处理 heuristics、模板与 body 的分工。

为什么这题有比较意义:这是最能拉开 creator 水平的题:输入脏、边界多、容易凭感觉总结,特别考验 instruction 架构。

这题怎么打分:格式分看固定 heading 和 bullet 结构;语义分看 Goal / Decisions / Open Questions / Next Steps 四类信息是否被准确抽出。

官方原版

平均语义准确率:86.11

平均创建耗时:103.92s

平均创建 token:125,870

My Skill Creator Copy

平均语义准确率:80.55

平均创建耗时:95.45s

平均创建 token:132,610

Dazhuang Skill Creator

平均语义准确率:100

平均创建耗时:68.43s

平均创建 token:89,581

Task ib_launch

Normalize these call notes into the standard brief: - Goal is to launch self-serve SSO in May - Decided to keep SCIM out of v1 - Open question: should audit logs be enterprise-only? - Next step: Sara drafts rollout plan by Tuesday
Representative output = repeat 1 / Score = avg across 3 repeats
Expected / rubric target { "Goal": [ [ "launch", "sso", "may" ] ], "Decisions": [ [ "scim", "v1" ] ], "Open Questions": [ [ "audit", "logs", "enterprise" ] ], "Next Steps": [ [ "sara", "rollout", "tuesday" ] ] }
Why this task is in the benchmark

这是最能拉开 creator 水平的题:输入脏、边界多、容易凭感觉总结,特别考验 instruction 架构。

Scoring rule: 格式分看固定 heading 和 bullet 结构;语义分看 Goal / Decisions / Open Questions / Next Steps 四类信息是否被准确抽出。

Baseline Avg 0
Goal: Launch self-serve SSO in May Decisions: - Keep SCIM out of v1 Open Questions: - Should audit logs be enterprise-only? Next Steps: - Sara to draft the rollout plan by Tuesday
官方原版 Avg 100
## Goal - Launch self-serve SSO in May. ## Decisions - Keep SCIM out of v1. ## Open Questions - Decide whether audit logs should be enterprise-only. ## Next Steps - Sara drafts the rollout plan by Tuesday.
My Skill Creator Copy Avg 100
## Goal - Launch self-serve SSO in May. ## Decisions - SCIM is out of scope for v1. ## Open Questions - Should audit logs be limited to enterprise plans? ## Next Steps - Sara drafts the rollout plan by Tuesday.
Dazhuang Skill Creator Avg 100
## Goal - Launch self-serve SSO in May ## Decisions - Keep SCIM out of v1 ## Open Questions - Should audit logs be enterprise-only? ## Next Steps - Sara drafts the rollout plan by Tuesday

Task ib_pricing

Turn this stakeholder call dump into the standard brief: - We are trying to reduce enterprise sales friction - Decision: show annual pricing on the website - Decision: legal review not needed for the copy tweak - Next: Max updates the pricing page tomorrow
Representative output = repeat 1 / Score = avg across 3 repeats
Expected / rubric target { "Goal": [ [ "reduce", "sales", "friction" ] ], "Decisions": [ [ "annual", "pricing" ], [ "legal", "review" ] ], "Open Questions": [], "Next Steps": [ [ "max", "pricing", "tomorrow" ] ] }
Why this task is in the benchmark

这是最能拉开 creator 水平的题:输入脏、边界多、容易凭感觉总结,特别考验 instruction 架构。

Scoring rule: 格式分看固定 heading 和 bullet 结构;语义分看 Goal / Decisions / Open Questions / Next Steps 四类信息是否被准确抽出。

Baseline Avg 0
Goal: Reduce enterprise sales friction Decisions: - Show annual pricing on the website - Legal review is not needed for the copy tweak Next Steps: - Max to update the pricing page tomorrow
官方原版 Avg 100
## Goal - Reduce enterprise sales friction. ## Decisions - Show annual pricing on the website. - Legal review is not needed for the copy tweak. ## Open Questions - None ## Next Steps - Max updates the pricing page tomorrow.
My Skill Creator Copy Avg 100
## Goal - Reduce friction in the enterprise sales process. ## Decisions - Show annual pricing on the website. - No legal review is needed for the pricing copy update. ## Open Questions - None ## Next Steps - Max updates the pricing page tomorrow.
Dazhuang Skill Creator Avg 100
## Goal - Reduce enterprise sales friction ## Decisions - Show annual pricing on the website - Legal review is not needed for the copy tweak ## Open Questions - None ## Next Steps - Max updates the pricing page tomorrow

Task ib_research

Please standardize these notes: - Want clearer export limits for pro users - Open question: should we hard-cap rows or queue large exports? - Next step: Priya interviews 5 pro customers next week
Representative output = repeat 1 / Score = avg across 3 repeats
Expected / rubric target { "Goal": [ [ "clearer", "export", "pro" ] ], "Decisions": [], "Open Questions": [ [ "hard-cap", "queue", "exports" ] ], "Next Steps": [ [ "priya", "5", "next week" ] ] }
Why this task is in the benchmark

这是最能拉开 creator 水平的题:输入脏、边界多、容易凭感觉总结,特别考验 instruction 架构。

Scoring rule: 格式分看固定 heading 和 bullet 结构;语义分看 Goal / Decisions / Open Questions / Next Steps 四类信息是否被准确抽出。

Baseline Avg 0
Goal: Clarify export limits for pro users Open Questions: - Should exports have a hard row cap or queue large exports? Next Steps: - Priya to interview 5 pro customers next week
官方原版 Avg 58.33
## Goal - Clarify export limits for pro users. ## Decisions - None ## Open Questions - Decide whether to hard-cap rows or queue large exports. ## Next Steps - Priya interviews five pro customers next week.
My Skill Creator Copy Avg 41.67
## Goal - Clarify export limits for pro users. ## Decisions - None ## Open Questions - Should large exports have a hard row cap or be handled through a queue? ## Next Steps - Priya interviews five pro customers next week.
Dazhuang Skill Creator Avg 100
## Goal - Clearer export limits for pro users ## Decisions - None ## Open Questions - Should exports hard-cap rows or queue large exports? ## Next Steps - Priya interviews 5 pro customers next week

File Paths

相关原始文件

如果你想继续深挖,可以直接看这些源数据。